blue_white_red.png
    generated with pymol spectrumbar script in the scripts dir

blue_white_red_colorbar.{png,svg}
    above with added scale annotation. Used for creating composite
    images of .pdb models

AFDBv2_PDB_daliDigest.tgz:

    Note: This file at ~13GB was too large to deposit with this data set. Below
    is an outline of the procedure used to generate it.

    I downloaded the alphafold2 digest from the dali authors at
    http://ekhidna2.biocenter.helsinki.fi/dali/AF-Digest.tar.gz
    on July 22, 2022

    This corresponds to alfafolddb v2 which includes swissprot However it
    caused dali to fail and return unexpected results. Further investigation
    suggested that empty DAT files may be the issue. Here is how I created the
    fixed database

    1) unpack to a scratch directory

    ```
    $ wd=$PWD
    $ cd /path/to/scratch
    $ mkdir af
    $ tar -C af -xzf $wd/AF-Digest.tar.gz
    ```

    2) remove empty AFDB dat files and remove the empty identifiers from list
    I'm guessing this is similar to my experience creating the original dali
    alphafold index manually - there is just some number of failures. And it
    looks like they were not removed here. They seemed to be mostly short
    peptides and all from the v2 part of the database

    ```
    $ find af/DAT -empty -printf '.' | wc -c
    2368
    $ find af/DAT -empty | sed 's:af/DAT/:: ; s:\.dat$::' > bad_ids
    $ find af/DAT -empty -exec rm {} +
    ```

    There were also some with more than 200 structural elements even though they
    were not supposed to be there - in the "digest" they are supposed to
    be split into 2

    ```
    $ find af/DAT -exec awk 'NR==1 && $4 > 200 {print $2} {exit}' {} +
    o43kA
    ```
    This was a file that was split (there is a o43kB) but for some reason
    205 structural elements were left behind in the first pseudochain. Also
    needs to be removed since dali cannot process anything with 200 or more
    structural elements

    ```
    echo -e "o43kA\no43kB" >> bad_ids
    rm -f af/DAT/o43k{A,B}.dat
    ```
    
    Clean up
    ```
    $ cat <<__EOF__ > clean.py
    #! /usr/bin/env python3
    import sys
    from Bio import SeqIO

    # read the bad identifiers
    bad = []
    with open(sys.argv[1], encoding="ascii") as fh:
        for line in fh:
            bad.append(line.strip())
    sbad = set(bad)


    with open(sys.argv[2], encoding="ascii") as fh:
        if sys.argv[2].endswith("list"):
            print(f"cleaning a list file: {sys.argv[2]}", file=sys.stderr)
            n_rem = 0
            for line in fh:
                if line.split()[0] in sbad:
                    n_rem += 1
                    continue
                print(line, end="")
        else:
            print(f"cleaning a fasta file: {sys.argv[2]}", file=sys.stderr)
            n_rem = 0
            for rec in SeqIO.parse(fh, "fasta"):
                if rec.id in sbad:
                    n_rem += 1
                    continue
                SeqIO.write(rec, sys.stdout, "fasta")
    print(f"removed {n_rem} lines", file=sys.stderr)
    __EOF__
    $ chmod +x clean.py

    $ for f in af/Digest/* ; do
        ./clean.py empty_ids $f > $(basename $f)
        mv $f ${f}.orig
        mv $(basename $f) af/Digest
      done
    ```

    removed the .orig files after some error checking

    3) create blast db

    ```
    $ module load blast
    $ makeblastdb -in af/Digest/AFDB2.fasta -dbtype prot
    ```

    4) import PDB. Even though PDB and the fake alphafolddb ids dali makes up
    should not overlap i put it into a separate folder to keep things clean
    --pdblist did not work.

    ```
    $ mkdir -p pdb/{DAT,DAT_tmp,Digest}
    $ find /fdb/pdb/ -name 'pdb*.ent.gz' > filelist
    $ wc -l filelist
    190098
    $ cd pdb
    $ cat <<'__EOF__' > mkpdb.sh
    n=0
    failed=0
    total=$(wc -l < filelist)

    err=$PWD/errors.csv
    errlog=$PWD/errors.log
    ok=$PWD/imported.csv
    dat=$PWD/DAT
    while read -r pdb ; do
        ## write dat file(s) to temp dir in case something goes wrong
        ## --clean remove the temporary .dssp file
        pdbid="$(basename "$pdb" .ent.gz | cut -c4-7)"
        import_ok=true
        tmp=$(mktemp -d ./XXXXXX)
        pushd "$tmp" &> /dev/null || exit 1
        mkdir DAT
        if ! import.pl --pdbfile "$pdb" --pdbid "$pdbid" &> log ; then
            ((failed++))
            printf "%s,%s,ERROR: import.pl returned a non-zero exit code\n" "$pdbid" "$pdb" >> "$err"
            import_ok=false
        fi

        ## did it produce output? Sometimes import.pl creates no .dat files
        nout=$(find DAT -type f -printf '.' | wc -c)
        if [[ $nout -eq 0 && $import_ok ]] ; then
            ((failed++))
            import_ok=false
            printf "%s,%s,ERROR: import.pl did not produce output\n" "$pdbid" "$pdb" >> "err"
        fi

        if ! $import_ok ; then
            printf -- "-------------- %s --------------\n" "$pdbid" >> "$errlog"
            cat log >> "$errlog"
        else
            # dali is limited to models with no more than 200 secondary structure elements
            # retain any chains that have <= 200 elements

            for datfile in DAT/*.dat ; do
                nsse=$(awk 'NR==1 && /^>>>>/ {print $4; exit}' "$datfile")
                chain="$(basename "$datfile" .dat)"
                if [[ $nsse -gt 200 ]] ; then
                    printf "%s,%s,%s,nsse[%i] > 200\n" "$chain" "$pdbid" "$pdb" $nsse >> "$err"
                else
                    mv "$datfile" "$dat"
                    printf "%s,%s,%s\n" "$chain" "$pdbid" "$pdb" >> "$ok"
                fi
            done
        fi
        popd &> /dev/null || exit 1
        rm -rf "$tmp"
        ((n++))
        printf "\r### processing [%4i / %4i] failed: %4i" $n $total $failed
    done < filelist
    printf "\n\n"
    __EOF__

    $ module load dali
    $ bash mkpdb.sh
    ### remove chains with 0 residues (e.g. RNA in ribosome)
    $ cat <<__EOF__ > clean_pdb.py
    import sys
    import os
    from Bio import SeqIO

    ### go through ids and find chains that did not include any aa residues
    with open("imported.csv", encoding="ascii") as fh:
        for line in fh:
            chain, pdbid, file = line.strip().split(",")
            with open(f"DAT/{chain}.dat", encoding="ascii") as datfh:
                header = datfh.readline().strip().split()
            if header[2] == "0":
                print(f"deleting {chain} - no aa residues", file=sys.stderr)
                os.remove(f"DAT/{chain}.dat")
            else:
                print(f"{chain},{pdbid},{file}")
    __EOF__
    $ python3 clean_pdb.py > imported-clean.csv
    $ awk -F',' '{print $1}' imported-clean.csv > Digest/PDB.list
    $ dat2fasta.pl "DAT" < Digest/PDB.list > Digest/PDB.fasta
    $ makeblastdb -in Digest/PDB.fasta -dbtype prot
    $ ml cd-hit
    $ cd-hit -T $SLURM_CPUS_PER_TASK -M $(( SLURM_MEM_PER_NODE - 1000 )) \
          -i Digest/PDB.fasta -c 0.7 -o Digest/PDB_70.fasta
    $ grep '^>' Digest/PDB_70.fasta | perl -pe 's/^>//' > Digest/PDB_70.list
    $ grep '^>' Digest/PDB.fasta | perl -pe 's/^>//' | tr -s ' ' ' ' > Digest/PDB.list
    ```

    Kept separate copies of the relevant .list files for easier annotation.

    4) packed it back up and gave it a better name. also contains
       the extra af level

    ```
    $ tar -cf - af pdb | pigz -p 4 > $wd/AFDBv2_PDB_daliDigest.tgz
